Method, system and module for mult-modal data fusion

ABSTRACT

A method for multi-modal data fusion ( 100 ), a multi-modal system data fusion system ( 10 ) and module ( 24 ) that in use operates by receiving segments ( 125 ) of multi-modal data associated respectively with a modalitiy. Initiating ( 130 ) a dynamically variable wait period after one of the segments is received is then performed. The dynamically variable wait period has a duration determined from data fusion timing statistics of the system ( 10 ). A waiting ( 140 ) for reception of any further segments during the dynamically variable wait period is then effected and thereafter a fusing ( 145 ) of the segments received provides fused data that is sent ( 160 ) to a dialog manager ( 25 ).

FIELD OF THE INVENTION

[0001] This invention relates to a method system and module formulti-modal data fusion. The invention is particularly useful for, butnot necessarily limited to, real time multi-modal data fusion.

BACKGROUND ART

[0002] In interactive multi-modal data fusion systems, data, requestsand commands are received and processed from a variety of inputmodalities including speech, text and graphics. The multi-modal data,requests and commands are combined and acted upon by a dialog manager.Known interactive multi-modal data fusion systems have a static waitperiod (for example 4 seconds) to combine multi-modal data, requests andcommands and send to the dialogue manager. In such multi-modal systems,the system waits to receive a first data packet, request or command fromthe user after the generation of a system response. Once the first datapacket, request or command has been received from a modality, the datafusion system waits for the static time period to receive further datapacket, request or command from other modalities. After the end of thestatic time period all the data packet, request or command receivedduring that period are fused together using different methods, such asunification, agglomeration or otherwise.

[0003] This static wait period can provide an observable delay anderroneous responses may be provided by the dialog manager as this periodmay be too short for specific users or for complex requests andcommands. Furthermore, the static waiting period may be too long forcertain input modalities for example touch input but too short forothers for instance speech input.

[0004] In this specification, including the claims, the terms‘comprises’, ‘comprising’ or similar terms are intended to mean anon-exclusive inclusion, such that a method or apparatus that comprisesa list of elements does not include those elements solely, but may wellinclude other elements not listed.

SUMMARY OF THE INVENTION

[0005] According to one aspect of the invention there is provided amethod for multi-modal data fusion in a multi-modal system comprising aplurality of input modalities, the method comprising the steps of:

[0006] (i) receiving one or more segments of multi-modal data, each ofsaid segments being associated respectively with said modalities;

[0007] (ii) initiating a dynamically variable wait period after one ofsaid segments is received, said dynamically variable wait period havinga duration determined from data fusion timing statistics of the system;

[0008] (iii) waiting for reception of any further said segments duringsaid dynamically variable wait period; and

[0009] (iv) fusing said segments received during said steps (i) to (iii)to provide fused data.

[0010] The method may suitably characterized by repeating steps (ii) to(iv) if one or more said further said segments are received during saiddynamically variable wait period.

[0011] Preferably, the method may include the further step of sendingsaid fused data to a dialog manager.

[0012] Suitably, the duration of the timing statistics may be determinedfrom historical statistical data of the system.

[0013] Preferably, the historical statistical data may include averagesegment creation time for each modality of said system.

[0014] Preferably, duration of the dynamically variable wait period canbe further determined from analysis of a period starting aftercompletion of receiving one of said segments and ending after completionof receiving a further one of said segments.

[0015] Preferably, duration of the dynamically variable wait period canbe determined from analysis of the following:

[0016] min (max(AvgDur_(m)) or (AvgTimeDiff+AvgDur))

[0017] where AvgDur_(m) is a value indicative of the average durationassociated with processing a user request into a one of said segmentsfor a modality m, AvgTimeDiff is a value indicative of an average periodfor periods starting after completion of receiving one of said segmentsand ending after completion of receiving a further one of said segments,and AvgDur is a value indicative of the average duration associated withprocessing a user request into a one of said segments for every modalityin the system.

[0018] Preferably, said segments can be frames.

[0019] Suitably, said frames may include temporal characteristicsincluding at least part of said historical statistical data. The framesmay also include semantic representations an associated user request.

[0020] According to another aspect of the invention there is provided amulti-modal data fusion system comprising:

[0021] a plurality of modality processing modules;

[0022] a plurality of parsers coupled to a respective one of saidmodality processing modules;

[0023] a multi-modal fusion module having inputs coupled to outputs ofsaid parsers, wherein in use said fusion module receives one or moresegments of multi-modal data from at least one of said parsers, andinitiates a dynamically variable wait period after one of said segmentsis received, said dynamically variable wait period having a durationdetermined from data fusion timing statistics of the system, the fusionmodule then waits for reception of any further said segments during saiddynamically variable wait period and fuses said segments received toprovide fused data.

[0024] Suitably, there is a dialogue manager coupled to an output ofsaid fusion module.

[0025] Preferably, there are user input devices coupled respectively tosaid modality processing modules.

[0026] Suitably, the system may in use effect any or all combinations ofthe steps and characteristics of the above method.

[0027] According to another aspect of the invention there is provided amulti-modal fusion module having inputs for coupling to outputs ofparsers, wherein in use said fusion module receives one or more segmentsof multi-modal data from at least one of said parsers, and initiates adynamically variable wait period after one of said segments is received,said dynamically variable wait period having a duration determined fromdata fusion timing statistics of the system, the fusion module thenwaits for reception of any further said segments during said dynamicallyvariable wait period and fuses said segments received to provide fuseddata.

[0028] Suitably, said fusion module may in use effect any or allcombinations of the steps and characteristics of the above method.

BRIEF DESCRIPTION OF THE DRAWINGS

[0029] In order that the invention may be readily understood and putinto practical effect, reference will now be made to a preferredembodiment as illustrated with reference to the accompanying drawings inwhich:

[0030]FIG. 1 is a schematic block diagram illustrating a multi-modaldata fusion system in accordance with the invention;

[0031]FIG. 2 is a flow diagram illustrating a method for multi-modaldata fusion implemented on the multi-modal system of FIG. 1; and

[0032]FIG. 3 is a timing diagram illustrating a dynamically variablewait period used in the method of FIG. 2.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT OF THE INVENTION

[0033] In the drawings, like numerals on different Figs are used toindicate like elements throughout. With reference to FIG. 1 there isillustrated a multi-modal data fusion system 10 comprising a pluralityof parsers 28 (in the form of parsers 15,17,19,21,23) coupled to arespective one of a plurality of modality processing modules 29 (in theform of processing modules 14,16,18,20,22). In this embodiment, a speechprocessing module 14 is coupled to a text parser 15, a touch processingmodule 16 is coupled to a graphics parser 17, a scribing processingmodule 18 is coupled to a text parser 19, a string processing module 20is coupled to a text parser 21 and a gesture processing module 22 iscoupled to a gesture parser 23.

[0034] The multi-modal data fusion system 1 also includes a multi-modalfusion module 24 having inputs coupled to outputs of all the parsers 28.There is also a dialogue manager 25 coupled to an output of the fusionmodule 24 and there are input devices IDs in the form of a microphone11, touch screen 12 and camera 13 coupled respectively to associatedones of the modality processing modules 29.

[0035] An output of the dialog manager 25 is coupled to an output unit26 that provides a response to a user, the response being for instancean audible signal such a synthesized voice, or visual data andinformation such as text or graphics. The output of the dialog manager25 may be alternatively or also coupled to the touch screen 12 toprovide visual data and information to the user.

[0036] Referring to FIG. 2 there is shown a flow diagram illustrating amethod 100 for multi-modal data fusion implemented on the multi-modalsystem 10. Typically, at a start step 100, a user will actuate one ofthe input devices IDs, for example, the microphone 11 may be actuated bythe user simply speaking a phrase “I want to go to”. A processing step115 is then effected by the speech processing module 14 to provide adata string in the form of a text string that is parsed at a parsingstep 120 by the text parser 15. Text parsers 15, 19 and 21 parse theinput text strings and output semantic frames that include temporalinformation. A description of these text parsers 15, 19 and 21 can befound in James Allen, Natural Language Understanding (2nd Edition),Addison-Wesley, 1995 and is incorporated into this specification byreference.

[0037] The graphic parser 17 is similar to the text parser in that ituses the same grammar formalism to represent relationships betweengraphic objects. The graphic parser 17 interprets input from the touchprocessing module 16 and outputs semantic frames that include temporalinformation. The gesture Parser 23 can be implemented similarly tographic parser 17 and parser 17 interprets output requests from thegesture processing module 22 to provide semantic frames. The speechprocessing module 14 processes speech signals and provides an output ofa sequence of words or word graphs and is decscribed in LawrenceRabiner, Biing-Hwang Juang, Fundamentals of Speech Recognition, PrenticeHall PTR, 1993 and is incorporated into this specification by reference.The touch processing module 16 receives input data from touch screen 12and maps the input data to objects that can be represented by a list ofsymbols and relationships between the symbols and provides an output ofthe list of symbols of mapped objects.

[0038] The scribing processing module 18 process input requests andprovides output requests as sequences of words. The scribing processingmodule 18 is described in Seong-Whan Lee, Advances in HandwritingRecognition, World Scientific, 1999 and is incorporated into thisspecification by reference. The string processing module 20 capturesinput characters and generates a sequence of words based on spacesbetween the characters. The gesture processing module 22 processesvisual input and generates a sequence of symbols that represent aparticular category of gestures. The gesture processing module 22 isdescribed in Ming-Hsuan Yang, Narendra Ahuja, Face Detection and GestureRecognition for Human-Computer Interaction (The Kluwer InternationalSeries in Video Computing, 1) Kluwer Academic Publishers, 2001 and isincorporated into this specification by reference

[0039] After the parsing step 120 a receiving step 125 is effected,wherein the fusion module 24 effects receiving of one or more segmentsof multi-modal data, each of said segments being associated respectivelywith the modalities. For example, a segment in the form a frame thatincludes a parsed form from of “I want to go to” from text parser 15 isreceived at the fusion module 24. The frame also includes temporalinformation that comprises of pairs of typed features and featurevalues, a feature value can be another nested frame. Below in example 1there is an example of a frame identified as Frame F1. $\begin{bmatrix}\begin{matrix}{{Type}:} & {{Go\_ to}{\_ location}} \\{Modality\_ name:} & {Speech}\end{matrix} \\{{startinPoint}:\quad {10:{15:13.23}}} \\{{endingPoint}:\quad {{10:{15:15.00}}}}\end{bmatrix}\quad$

EXAMPLE 1 An Example of a Frame F1 Generated by the Text Parser

[0040] In Example 1, the text parser 15 parses the phrase “I want to goto” that is provided by the speech processing module 14 and generates apredicate “Go_to_location”(based on both domain grammar rules andsemantics). The predicate is then mapped to the TYPE “Go_to_location”and the frame is identified as Frame F1.

[0041] After Frame 1 is identified, the fusion module 24 effects ainitiating step 130 for initiating a dynamically variable wait periodafter one of the segments is received (e.g. Frame F1). This dynamicallyvariable wait period has a duration determined from data fusion timingstatistics of the system 10, the data fusion timing statistics beinghistorical statistical data of the system 10.

[0042] The fusion module 24 then enters a test step 135 and a waitingstep 140 loop until either the dynamically variable wait period hasexpired or further segments are received. Upon the dynamically variablewait period expiring or if a further segment is received, the method 100breaks out of the test step 135 and waiting step 140 loop. A fusing step145 is effected by fusion module 24 to effect a fusing of any segmentsreceived during the above steps 125 to 140 to provide fused data. If itis determined at a test step 150 that one or more further segments werereceived during the dynamically variable wait period, then the method100 returns to step 130 and the above steps 125 to 140 are repeated.Alternatively, if no further segments were received during thedynamically variable wait period then a step of updating and storingtiming statistics step 155 and a send step 160 provides for sending thefused data to the dialog manager. The method then terminates at an endsstep 165.

[0043] The duration of the timing statistics, and wait periods, aredetermined from historical statistical data of the system as follows:

[0044] For each modality (m) input to one of the processing modules 29,the fusion module maintains:

[0045] A variable count C_(m)—which holds the number of frames receivedfrom each modality m since startup;

[0046] A variable average duration AvgDur_(m);

[0047] A dynamically variable wait period time window (TW) thatdetermines the maximum time difference between the start times of twopieces of information (contained in respective frames) that can becombined together;

[0048] A variable average duration, AvgDur; A variable average timedifference, AvgTimeDiff;

[0049] A variable frame count, C, which holds the number of frames inthe current user turn;

[0050] An end Capture Time of last frame, ECT that is set to 0 at reset;and

[0051] A start Capture Time of last frame, SCT that is set to 0 atreset.

[0052] Also, each input Frame, F_(n), contains at least:

[0053] A Modality Name, N; (type)

[0054] A Start of Capture time, SC_(n); (the starting point) and

[0055] A End of Capture time, EC_(n) (the ending point).

[0056] From the above definitions, the timing statistics are updatedsuch that when an input frame, F_(n), from modality, m, is received:

[0057] a) AvgDur_(m) for that modality is updated usingAvgDur_(m)=(C_(m)*AvgDur_(m)+(EC_(n)−SC_(n)))/(C_(m)+1)

[0058] b) AvgDur is recalculated using the weighted mean${AvgDur} = \frac{\sum\limits_{m}\quad {C_{m}*{AvgDur}_{m}}}{\sum\limits_{m}\quad C_{m}}$

[0059] c) Count for that modality, C_(m) is incremented by 1

[0060] d) If (ECT !=0) then

[0061] a. AvgTimeDiff is updated usingAvgTimeDiff=(C*AvgTimeDiff+(SC_(n)−ECT))/(C+1)

[0062] e) Frame count, C, is incremented by 1

[0063] f) ECT is set to EC_(n)

[0064] g) SCT is set to SC_(n)

[0065] h) The frame is stored within a collection of frames for thecurrent user turn

[0066] If no input is received and a time equal to the time window TW iselapsed after the current value of SCT, then integration is performedwith all the frames received in the current turn.

[0067] When integration is started

[0068] a. ECT is reset to 0

[0069] b. SCT is reset to 0

[0070] c. The dynamically variable wait period time window TW istherefore updated using the following:

[0071] TW=min (max(AvgDur_(m)) or (AvgTimeDiff+AvgDur))

[0072] where AvgDur_(m) is a value indicative of the average durationassociated with processing a user request into a one of said segmentsfor a modality m, AvgTimeDiff is a value indicative of an average periodfor periods starting after completion of receiving one of said segmentsand ending after completion of receiving a further one of said segments,and AvgDur is a value indicative of the average duration associated withprocessing a user request into a one of said segments for every modalityin the system.

[0073] In a more generic sense TW(t)=f(TW|₀ ^(t−1))|m where m representsthe modalities. Thus, the time window TW at time t is an ergodic processthat is conditioned by modality factors such as modality state orambiguity in information. Hence, the time window TW may not be onlyaffected by the input times but it could also be affected by themodality m. Further, the historical statistical data typically includesaverage segment creation time for each modality m of the system 10. Inaddition, the duration of the dynamically variable wait period timewindow TW is further determined from analysis of a period starting aftercompletion of receiving one of the segments and ending after completionof receiving a further one of the segments.

[0074] Referring to FIG. 3, there is a timing diagram illustratingdynamically variable wait period time windows TWs. For instance, at atime t_(o), the speech module 14 starts processing a speech input (fromspeaker 11) such as “I want to go to” to provide a data string in theform of a text string Ts. Then, at time t₁, the text parser 15 hasparsed the text string Ts and provided the frame F₁, also at time t₁ thefusion module 24 initiates a dynamically variable wait period timewindow TW₁.

[0075] At a time t₂; the touch processing module 14 is actuated to startprocessing x,y co-ordinates of a map displayed on the touch screen 12 toprovide a data string in the form of a graphics string Gs, where the x,yco-ordinates are selected by a user touching the touch screen 12. Thenat a time t₃ the graphics parser 17 has parsed the graphics string Tsand provided a frame F2 and the wait period TW₁ has not expired,therefore at time t₃ the fusion module 24 initiates another dynamicallyvariable wait period TW₂. An example of frame F2 generated by thegraphics parser 17 is given below in example 2 in which x,y co-ordinatesof a map on touch screen 12 are selected by a user. The frame F2 isgenerated as Type Hotel by a map database associated with parser 17 thatassociated the selected x,y co-ordiantes as a Hotel that is known as theGrace Hotel. $\begin{bmatrix}{{Type}:\quad {{Hotel}}} \\{{Modality\_ Name:}\quad {Touch}} \\{{startingPoint}:\quad {10:{15:13.33}}} \\{{{endingPoint}:\quad {10:{15:{14:00}}}}} \\{{Content}:} \\{{Name}:\quad {{Grace}\quad {Hotel}}} \\{{Location}:\quad \begin{bmatrix}{{Type}:\quad {Street\_ Location}} \\{{Street\_ no:}\quad 12} \\{{Street\_ name:}\quad {Lord}\quad {Street}} \\{{Suburb}:\quad {Botany}} \\{{Zip}:\quad 2019}\end{bmatrix}}\end{bmatrix}\quad$

EXAMPLE 2 An Example of a Frame F2 Generated by the GRAPHICS Parser

[0076] Assuming at a time t₄ the dynamically variable wait period timewindow TW₂ has expired and data in frame F₁ and frame F₂ are fused andsent to the dialog manager 25.

[0077] Advantageously, dynamically variable wait period used in thepresent invention alleviates or reduces observable delays that can becaused by static wait periods used in conventional multi-modal datafusion systems. In use, the fusion module 24 of the present inventionreceives one or more segments of multi-modal data, in the form offrames, from at least one of the parsers 28 and initiates a dynamicallyvariable wait period after one of the segments is received. Thedynamically variable wait period has a duration determined from datafusion timing statistics of the system 10. The fusion module 24 thenwaits for reception of any further segments during the dynamicallyvariable wait period and fuses said segments received to provide fuseddata that is sent to the dialog manager 25. A completed set of fuseddata received during the steps of the above method 100 provides as aresponse to a user by either or both the touch screen 12 and output unit26.

[0078] The detailed description provides a preferred exemplaryembodiment only, and is not intended to limit the scope, applicability,or configuration of the invention. Rather, the detailed description ofthe preferred exemplary embodiment provides those skilled in the artwith an enabling description for implementing a preferred exemplaryembodiment of the invention. It should be understood that variouschanges may be made in the function and arrangement of elements withoutdeparting from the spirit and scope of the invention as set forth in theappended claims.

We claim:
 1. A method for multi-modal data fusion in a multi-modalsystem comprising a plurality of input modalities, the method comprisingthe steps of: (i) Receiving one or more segments of multi-modal data,each of said segments being associated respectively with saidmodalities; (ii) Initiating a dynamically variable wait period after oneof said segments is received, said dynamically variable wait periodhaving a duration determined from data fusion timing statistics of thesystem; (iii) Waiting for reception of any further said segments duringsaid dynamically variable wait period; and (v) Fusing said segmentsreceived during said steps (i) to (iii) to provide fused data.
 2. Amethod for multi-modal data fusion, as claimed in claim 1, furthercharacterized by repeating steps (ii) to (iv) if one or more saidfurther said segments are received during said dynamically variable waitperiod.
 3. A method for multi-modal data fusion, as claimed in claim 1,including the further step of sending said fused data to a dialogmanager.
 4. A method for multi-modal data fusion, as claimed in claim 1,wherein the duration of the timing statistics are determined fromhistorical statistical data of the system.
 5. A method for multi-modaldata fusion, as claimed in claim 4, wherein the historical statisticaldata includes average segment creation time for each modality of saidsystem.
 6. A method for multi-modal data fusion, as claimed in claim 5,the duration of the dynamically variable wait period is furtherdetermined from analysis of a period starting after completion ofreceiving one of said segments and ending after completion of receivinga further one of said segments.
 7. A method for multi-modal data fusion,as claimed in claim 1, wherein the duration of the dynamically variablewait period is determined from analysis of the following:min(max(AvgDur_(m)) or (AvgTimeDiff+AvgDur)) where AvgDur_(m) is a valueindicative of the average duration associated with processing a userrequest into a one of said segments for a modality m, AvgTimeDiff is avalue indicative of an average period for periods starting aftercompletion of receiving one of said segments and ending after completionof receiving a further one of said segments, and AvgDur is a valueindicative of the average duration associated with processing a userrequest into a one of said segments for every modality in the system. 8.A method for multi-modal data fusion, as claimed in claim 1, whereinsaid segments are frames.
 9. A method for multi-modal data fusion, asclaimed in claim 8, wherein said frames include temporal characteristicsincluding at least part of said historical statistical data. The framesmay also include semantic representations an associated user request.This is correct.
 10. A multi-modal data fusion system comprising: aplurality of modality processing modules; a plurality of parsers coupledto a respective one of said modality processing modules; a multi-modalfusion module having inputs coupled to outputs of said parsers, whereinin use said fusion module receives one or more segments of multi-modaldata from at least one of said parsers, and initiates a dynamicallyvariable wait period after one of said segments is received, saiddynamically variable wait period having a duration determined from datafusion timing statistics of the system, the fusion module then waits forreception of any further said segments during said dynamically variablewait period and fuses said segments received to provide fused data. 11.A multi-modal data fusion system as claimed in claim 10, wherein thereis a dialogue manager coupled to an output of said fusion module.
 12. Amulti-modal data fusion system as claimed in claim 10, there are userinput devices coupled respectively to said modality processing modules.13. A multi-modal data fusion system as claimed in claim 10, wherein theduration of the timing statistics are determined from historicalstatistical data of the system.
 14. A multi-modal data fusion system asclaimed in claim 13, wherein the historical statistical data includesaverage segment creation time for each modality of said system.
 15. Amulti-modal data fusion system as claimed in claim 14, wherein theduration of the dynamically variable wait period is further determinedfrom analysis of a period starting after completion of receiving one ofsaid segments and ending after completion of receiving a further one ofsaid segments.
 16. A multi-modal data fusion system as claimed in claim10, wherein the duration of the dynamically variable wait period isdetermined from analysis of the following: min(max(AvgDur_(m)) or(AvgTimeDiff+AvgDur)) where AvgDur_(m) is a value indicative of theaverage duration associated with processing a user request into a one ofsaid segments for a modality m, AvgTimeDiff is a value indicative of anaverage period for periods starting after completion of receiving one ofsaid segments and ending after completion of receiving a further one ofsaid segments, and AvgDur is a value indicative of the average durationassociated with processing a user request into a one of said segmentsfor every modality in the system.
 17. A multi-modal fusion module havinginputs for coupling to outputs of parsers, wherein in use said fusionmodule receives one or more segments of multi-modal data from at leastone of said parsers, and initiates a dynamically variable wait periodafter one of said segments is received, said dynamically variable waitperiod having a duration determined from data fusion timing statisticsof the system, the fusion module then waits for reception of any furthersaid segments during said dynamically variable wait period and fusessaid segments received to provide fused data.
 18. A multi-modal fusionmodule as claimed in claim 17, wherein the duration of the timingstatistics are determined from historical statistical data of thesystem.
 19. A multi-modal fusion module as claimed in claim 18, whereinthe historical statistical data includes average segment creation timefor each modality of said system.
 20. A multi-modal fusion module asclaimed in claim 19, wherein the duration of the dynamically variablewait period is further determined from analysis of a period startingafter completion of receiving one of said segments and ending aftercompletion of receiving a further one of said segments.
 21. Amulti-modal fusion module as claimed in claim 17, wherein the durationof the dynamically variable wait period is determined from analysis ofthe following: min(max(AvgDur₁) or (AvgTimeDiff+AvgDur)) whereAvgDur_(m) is a value indicative of the average duration associated withprocessing a user request into a one of said segments for a modality m,AvgTimeDiff is a value indicative of an average period for periodsstarting after completion of receiving one of said segments and endingafter completion of receiving a further one of said segments, and AvgDuris a value indicative of the average duration associated with processinga user request into a one of said segments for every modality in thesystem.